Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 147% (1.47x) speedup for all_columns_match in datacompy/fugue.py

⏱️ Runtime : 1.22 milliseconds 496 microseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 146% speedup through two key optimizations that eliminate redundant computations:

1. Optimized unq_columns function:

  • Original: Created two OrderedSet objects and used set subtraction: OrderedSet(col1) - OrderedSet(col2)
  • Optimized: Creates only one set(col2) and uses list comprehension with membership testing: OrderedSet(c for c in col1 if c not in col2_set)
  • Why faster: Set membership testing (c not in col2_set) is O(1) on average vs. the overhead of creating multiple OrderedSet objects and performing set arithmetic

2. Completely reimplemented all_columns_match function:

  • Original: Called unq_columns() twice, effectively calling fa.get_column_names() four times total and performing complex OrderedSet operations
  • Optimized: Calls fa.get_column_names() only twice (once per dataframe) and directly compares set(col1) == set(col2)
  • Why faster: The line profiler shows fa.get_column_names() is expensive (~10ms per call). Reducing from 4 calls to 2 calls plus using simple set equality eliminates the computational overhead of OrderedSet operations entirely.

Performance impact: The profiler data shows the original all_columns_match spent 100% of its time calling unq_columns, which in turn spent 99.8% of its time in fa.get_column_names(). The optimized version eliminates half of these expensive calls and replaces complex OrderedSet arithmetic with fast set operations.

This optimization is particularly beneficial for workloads that frequently check column matching between dataframes, as it reduces both the computational complexity and the number of expensive external API calls.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 42 Passed
🌀 Generated Regression Tests 🔘 None Found
⏪ Replay Tests 10 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_fugue/test_duckdb.py::test_all_columns_match_duckdb 159μs 62.8μs 154%✅
test_fugue/test_fugue_pandas.py::test_all_columns_match_native 181μs 70.4μs 158%✅
test_fugue/test_fugue_polars.py::test_all_columns_match_polars 215μs 91.7μs 135%✅
test_fugue/test_fugue_spark.py::test_all_columns_match_spark 213μs 66.5μs 221%✅
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_fugue_all_columns_match 453μs 204μs 122%✅

To edit these changes git checkout codeflash/optimize-all_columns_match-mi6hq9u0 and push.

Codeflash Static Badge

The optimized code achieves a **146% speedup** through two key optimizations that eliminate redundant computations:

**1. Optimized `unq_columns` function:**
- **Original**: Created two `OrderedSet` objects and used set subtraction: `OrderedSet(col1) - OrderedSet(col2)`
- **Optimized**: Creates only one `set(col2)` and uses list comprehension with membership testing: `OrderedSet(c for c in col1 if c not in col2_set)`
- **Why faster**: Set membership testing (`c not in col2_set`) is O(1) on average vs. the overhead of creating multiple OrderedSet objects and performing set arithmetic

**2. Completely reimplemented `all_columns_match` function:**
- **Original**: Called `unq_columns()` twice, effectively calling `fa.get_column_names()` four times total and performing complex OrderedSet operations
- **Optimized**: Calls `fa.get_column_names()` only twice (once per dataframe) and directly compares `set(col1) == set(col2)`
- **Why faster**: The line profiler shows `fa.get_column_names()` is expensive (~10ms per call). Reducing from 4 calls to 2 calls plus using simple set equality eliminates the computational overhead of OrderedSet operations entirely.

**Performance impact**: The profiler data shows the original `all_columns_match` spent 100% of its time calling `unq_columns`, which in turn spent 99.8% of its time in `fa.get_column_names()`. The optimized version eliminates half of these expensive calls and replaces complex OrderedSet arithmetic with fast set operations.

This optimization is particularly beneficial for workloads that frequently check column matching between dataframes, as it reduces both the computational complexity and the number of expensive external API calls.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 21:04
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant